This document is the summary of the R for Data Analysis workshop.
All correspondence related to this document should be addressed to:
Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA)
Email: omidreza.ghasemi@hdr.mq.edu.auArtwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations
R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.
10 + 10
## [1] 20
4 ^ 2
## [1] 16
(250 / 500) * 100
## [1] 50
R is a bit flexible with spacing (but no spacing in the name of variables and words)
10+10
## [1] 20
10 + 10
## [1] 20
R can sometimes tell that you’re not finished yet
10 +
How to create a variable? Variable assignment using <- and =. Note that R is case sensitive for everything
pay <- 250
month = 12
pay * month
## [1] 3000
salary <- pay * month
Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).
Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to define it:
my_multiplier <- function(a,b){
result = a * b
return (result)
}
This code do nothing. To get a result, you need to call it:
my_multiplier (a=2, b=4)
## [1] 8
# or: my_multiplier (2, 4)
We can set a default value for our arguments:
my_multiplier2 <- function(a,b=4){
result = a * b
return (result)
}
my_multiplier2 (a=2)
## [1] 8
# or: my_multiplier (2)
# or: my_multiplier (2, 6)
Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:
round(54.6787)
## [1] 55
round(54.5787, digits = 2)
## [1] 54.58
Use ? before the function name to get some help. For example, ?round. You will see many functions in the rest of the workshop.
function class() is used to show what is the type of a variable.
TRUE, FALSE can be abbreviated as T, F. They has to be capital, ‘true’ is not a logical data:class(TRUE)
## [1] "logical"
class(F)
## [1] "logical"
class(2)
## [1] "numeric"
class(13.46)
## [1] "numeric"
class("ha ha ha ha")
## [1] "character"
class("56.6")
## [1] "character"
class("TRUE")
## [1] "character"
Can we change the type of data in a variable? Yes, you need to use the function as.---()
as.numeric(TRUE)
## [1] 1
as.character(4)
## [1] "4"
as.numeric("4.5")
## [1] 4.5
as.numeric("Hello")
## Warning: NAs introduced by coercion
## [1] NA
When there are more than one number or letter stored. Use the combine function c() for that.
sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)
sale <- c(1:10)
sale * sale
## [1] 1 4 9 16 25 36 49 64 81 100
Subsetting a vector:
days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
days[2]
## [1] "Sunday"
days[-2]
## [1] "Saturday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
days[c(2, 3, 4)]
## [1] "Sunday" "Monday" "Tuesday"
my_vector with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:my_vector <- (0:1000)
mean(my_vector)
## [1] 500
median(my_vector)
## [1] 500
min(my_vector)
## [1] 0
range(my_vector)
## [1] 0 1000
class(my_vector)
## [1] "integer"
sum(my_vector)
## [1] 500500
sd(my_vector)
## [1] 289.1081
List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.
my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4 5 6 7
##
## [[5]]
## [1] "HELLO"
##
## [[6]]
## [1] "hello"
##
## [[7]]
## [1] FALSE
Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with “male” and “female” entries:
gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)
R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).
summary(gender)
## female female male
## 1 2 3
gender
## [1] male male male female female female
## Levels: female female male
So, be careful of spaces!
rep() function):gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
## [1] male male male male male male male male male male
## [11] male male male male male male male male male male
## [21] male male male male male male male male male male
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## [51] female female female female female female female female female female
## [61] female female female female female female female female female female
## Levels: female male
There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the factor() function: ordered = TRUE, and levels = c("level1", "level2"). For example, we have a vector that shows participants’ education level.
edu<-c(3,2,3,4,1,2,2,3,4)
education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
## [1] College high school College Uni graduated Primary school
## [6] high school high school College Uni graduated
## Levels: Primary school < high school < College < Uni graduated
patient and control values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: control patient
health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: patient control
Finally, can you relabel both levels to uppercase characters? (Hint: check ?factor)
health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
## [1] Patient Patient Patient Patient Patient Control Control Control Control
## [10] Control
## Levels: Patient Control
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.
my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)
my_matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let’s create a dataframe:
id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5))
my_dataframe <-data.frame(Patient = id,
Treatment = group,
Response = response)
We also could have done the below
my_dataframe <-data.frame(Patient = c(1:200),
Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
Response = c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5)))
In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.
head(my_dataframe)
tail(my_dataframe)
| Patient | Treatment | Response | |
|---|---|---|---|
| 1 | 1 | Psychotherapy | 35.59862 |
| 2 | 2 | Psychotherapy | 36.62962 |
| 3 | 3 | Psychotherapy | 29.01446 |
| 4 | 4 | Psychotherapy | 22.71264 |
| 5 | 5 | Psychotherapy | 33.70909 |
| 6 | 6 | Psychotherapy | 35.58150 |
| Patient | Treatment | Response | |
|---|---|---|---|
| 195 | 195 | Medication | 19.73908 |
| 196 | 196 | Medication | 23.20115 |
| 197 | 197 | Medication | 24.25496 |
| 198 | 198 | Medication | 31.60629 |
| 199 | 199 | Medication | 14.19786 |
| 200 | 200 | Medication | 33.20383 |
Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:
my_dataframe[35, 3]
## [1] 39.75112
my_dataframe[1:10, ]
| Patient | Treatment | Response |
|---|---|---|
| 1 | Psychotherapy | 35.59862 |
| 2 | Psychotherapy | 36.62962 |
| 3 | Psychotherapy | 29.01446 |
| 4 | Psychotherapy | 22.71264 |
| 5 | Psychotherapy | 33.70909 |
| 6 | Psychotherapy | 35.58150 |
| 7 | Psychotherapy | 26.91184 |
| 8 | Psychotherapy | 24.35061 |
| 9 | Psychotherapy | 30.09706 |
| 10 | Psychotherapy | 28.20006 |
How to get only the Response column for all participants?
my_dataframe[ , 3]
## [1] 35.59862 36.62962 29.01446 22.71264 33.70909 35.58150 26.91184 24.35061
## [9] 30.09706 28.20006 32.09037 25.15059 25.23804 35.73875 42.61423 32.24317
## [17] 24.60450 33.02065 36.51803 33.96189 43.34802 31.97126 31.66812 28.79656
## [25] 25.97128 33.38626 30.57732 23.27546 19.68985 28.17951 29.98973 27.16092
## [33] 31.99661 26.27210 39.75112 27.47717 16.54230 21.75016 33.36571 36.44900
## [41] 30.70574 29.09672 29.43279 22.77227 20.96891 36.73037 28.83073 31.67700
## [49] 34.54895 28.85963 27.09316 36.16230 24.35933 23.88968 34.71001 24.40981
## [57] 32.97373 33.39629 18.96927 35.96424 27.82566 21.47080 30.61677 32.20519
## [65] 35.46101 26.53701 39.78074 30.75453 20.34907 30.20975 36.76597 32.91871
## [73] 25.03069 35.65879 27.68528 36.40671 27.88243 28.90775 25.83936 41.69814
## [81] 44.44514 27.74542 28.13034 23.14361 20.74980 31.84590 36.95674 32.19458
## [89] 27.70576 34.45078 31.68835 29.76735 30.29011 30.04594 28.00025 34.60662
## [97] 25.64952 31.98999 29.15195 28.50349 24.16872 19.86797 19.86773 28.39305
## [105] 22.58449 21.74178 30.88721 21.84749 28.48663 22.40085 35.49097 28.02832
## [113] 31.35924 23.76777 21.67946 14.10546 22.86193 27.90598 24.61302 23.26970
## [121] 15.32896 30.49925 21.78486 23.99288 25.47913 34.21774 24.12370 26.81657
## [129] 29.65434 18.07262 28.98266 25.36519 31.33496 15.23519 24.89588 17.40844
## [137] 35.31352 19.03414 30.53285 20.96974 36.13117 25.31783 24.70286 24.30349
## [145] 23.77750 25.06979 16.42690 29.35791 34.99587 16.98754 26.94997 28.67525
## [153] 22.93114 18.99034 13.04770 27.22615 22.70794 27.14391 15.72608 28.32175
## [161] 17.88373 31.00352 20.02655 31.80458 29.67562 23.72706 30.19105 25.62265
## [169] 15.54247 19.61728 25.53547 28.17935 30.10152 35.65953 16.25308 19.49600
## [177] 23.54306 22.12626 22.47110 27.36396 28.10063 27.52071 27.52407 27.88606
## [185] 16.75499 29.03299 24.35035 30.20982 27.80097 18.65039 24.78246 24.88516
## [193] 26.85609 33.23869 19.73908 23.20115 24.25496 31.60629 14.19786 33.20383
Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:
my_dataframe[ , "Response"]
# OR:
my_dataframe$Response
So far, we created dataframes using data.frame function from the base R. However, a better way to create dataframes is to use the tibble function from tidyverse (see here).